Add some unit tests for sled-agent Instance creation #4489
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Depends on #4325 for faking zone creation.
At time of writing, instance creation roughly looks like:
instance_put_state
InstanceManager::ensure_state
Instance::propolis_ensure
cpapi_instances_put
(if not migrating)Instance::setup_propolis_locked
(blocking!)RunningZone::install
andZones::boot
illumos_utils::svc::wait_for_service
self::wait_for_http_server
for propolis-server itselfInstance::ensure_propolis_and_tasks
Instance::monitor_state_task
cpapi_instances_put
(if not migrating)handle_instance_put_result
Or at least, it does in the happy path. #3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
instance_put_state
. That might've looked something like:instance_put_state
InstanceManager::ensure_state
Instance::propolis_ensure
cpapi_instances_put
(if not migrating)Instance::setup_propolis_locked
(blocking!)RunningZone::install
andZones::boot
handle_instance_put_result
To avoid this timeout being implicit at the Dropshot configuration
layer (that is to say, we should still have some timeout),
we could consider a small refactor to make
instance_put_state
not ablocking call -- especially since it's already sending nexus updates on
its progress via out-of-band
cpapi_instances_put
calls! That might looksomething like:
instance_put_state
InstanceManager::ensure_state
Instance::propolis_ensure
cpapi_instances_put
(if not migrating)Instance::setup_propolis_locked
(blocking!)Instance::ensure_propolis_and_tasks
Instance::monitor_state_task
cpapi_instances_put
(if not migrating)handle_instance_put_result
nexus currently invokes after getting the response from the blocking call(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)
Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.